Learned Vector-Space Models for Document Retrieval

نویسندگان

  • William R. Caid
  • Susan T. Dumais
  • Stephen I. Gallant
چکیده

Bellcore's Latent Semantic Indexing (LSI) system and HNC's MatchPlus system represent two attempts to model and exploit the inter-relationships among terms to improve information retrieval. Most information retrieval methods depend on exact matches between words in users' queries and words in documents. Typically, documents containing one or more query words are returned to the user. Such methods will, however, fail to retrieve relevant materials that do not share words with users' queries. One reason for this is that the standard retrieval models (e.g., Boolean, standard vector, probabilistic) treat words as if they are pairwise orthogonal or independent, although it is quite obvious that they are not. Consider, for example, the terms "automobile", "car", "driver", and "elephant". The terms "automobile" and "car" are synonyms, "driver" is a related concept, and "elephant" is pretty much unrelated. In most retrieval systems the query "automobile" is no more likely to retrieve an article about cars than one about elephants, if neither author uses precisely the term automobile. It would be preferable, however, if a query about automobiles would retrieve articles about cars and even articles about drivers to a lesser extent. A central theme of both the LSI and MatchPlus methods is that term-term inter-relationships like these can be explicitly modeled in the representation and automatically used to improve retrieval.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Probability Bracket Notation, Term Vector Space, Concept Fock Space and Induced Probabilistic IR Models

After a brief introduction to Probability Bracket Notation (PBN) for discrete random variables in time-independent probability spaces, we apply both PBN and Dirac notation to investigate probabilistic modeling for information retrieval (IR). We derive the expressions of relevance of document to query (RDQ) for various probabilistic models, induced by Term Vector Space (TVS) and by Concept Fock ...

متن کامل

An Artificial Intelligence Approach To Information Retrieval

Document structure weighting is a technique whereby different parts of a document (title, abstract, etc.) contribute unevenly to the overall document weight during ranking. Near optimal weights can be learned with a GA. Doing so shows a statistically significant 5% relative improvement in MAP for vector space inner product and Croft’s probabilistic ranking, but no improvement for BM25. Two appl...

متن کامل

Lecture 5 : Introduction to ( Robertson / Spärck Jones ) Probabilistic Retrieval

In this lecture, we will introduce our second paradigm for document retrieval: probabilistic retrieval. We will focus on Roberston and Spärck Jones’ 1976 version, presented in the paper Relevance Weighting of Search Terms. This was an influential paper that was published when the Vector Space Model was first being developed — it is important to keep in mind the differences and similarities betw...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Inf. Process. Manage.

دوره 31  شماره 

صفحات  -

تاریخ انتشار 1995